The XML Framework and Its Implications for Corpus Access and Use

نویسنده

  • Nancy Ide
چکیده

The eXtensible Markup Language (XML) (Bray, et al., 1998) is the emerging standard for data representation and exchange on the World Wide Web. The XML Framework includes very powerful mechanisms for accessing and manipulating XML documents that are likely to significantly impact the way annotated corpora are created and accessed. This paper outlines a few of the possibilities. Introduction The eXtensible Markup Language (XML) (Bray, et al., 1998) is the emerging standard for data representation and exchange on the World Wide Web. At its most basic level XML is a document markup language directly derived from SGML (i.e., allowing tagged text (elements), element nesting, and element references). As such, translation of an SGML encoded document into XML is relatively trivial. However, various features and extensions of XML make it a far more powerful tool for data representation and access than SGML. The following outlines some of these mechanisms and suggests ways in which they can be used for creation and exploitation of annotated corpora. XML links The recommended practice in encoding annotated corpora is to maintain all or most annotations in separate documents, each of which references appropriate locations in the document containing the original data (Ide & Brew, 2000). This strategy yields, in essence, a finely linked hypertext format where the links specify a semantic role rather than navigational options. That is, links signify the location(s) where markup contained in a given annotation document would appear in the document to which it is linked. As such, annotation information comprises remote or "stand-off" markup that is virtually added to the document to which it is linked. In principle, the original data could contain no markup at all (or, more likely, markup for gross logical structure only); all markup could be retained in separate documents with links into the original based on offsets. The standoff scheme, then, requires addressing XML elements, as well as characters and chains of characters within those elements. It also requires that elements and characters can be addressed both within the same document and in other XML documents. XML provides the following linking mechanisms, which are substantially more powerful than the mechanisms provided in SGML, which satisfy these requirements: • XLink (DeRose, et al., 2000), a mechanism for specifying a link (uni-directional or more complex linking structures) between two or more resources or portions of resources; • the XML Path Language (XPath) (Clark & DeRose, 1999), an extended addressing syntax that defines a concise notation for element localization in the document tree (as defined by the nesting of elements in the document itself), and allows addressing text fragments within a particular element by providing predicates for manipulating chains of characters; • XPointer (DeRose, Daniel, & Maler, 1999), which extends XPath syntax to allow addressing points and ranges as well as nodes, locating information by string matching, and use of addressing expressions in URI-references as fragment identifiers. For example , the Xpa th express ion /div/p[2]/s[3] specifies the third < s > (sentence) element within the second (paragraph) element within each (text division) element; /descendant::p specifies all elements in the document. In addition, Xpath allows addressing text fragments within a particular element by providing predicates for manipulating chains of characters. The expression substring(/p/s[2]/text(),6) selects the string "one would expect that the whole sky would be as bright as the sun, even at night." from the following text: The difficulty is that in an infinite static universe nearly every line of sight would end on the surface of a star.Thus one would expect that the whole sky would be as bright as the sun, even at night.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Corpus-based Analysis of Epistemic Stance Adverbs in Essays Written by Native English Speakers and Iranian EFL Learners

Academic essays entail taking a stance on the truth value of propositions. Epistemic adverbs deal with the speaker's assessment of the truth value of propositions. Employing a corpus-based approach with descriptive statistics and qualitative description, this study explored the use of epistemic stance adverbs in academic essays written by native English speakers and Iranian EFL learners. Follow...

متن کامل

Conceptual Metaphoric Language Use in Structuring Political Discourse in Iran-West Relations: A CDA Perspective

The present study was carried out with the purpose of examining the role of metaphorical language in the critical discourse analysis (CDA) of political texts based on a modern framework postulated by Kövecses (2015). The corpus of the study consisted of thirty-thousand words chosen as a textual sample to see which source conceptual domains are used and what generic/discursive attributes emerge ...

متن کامل

The XML Framework and Its Implications for the Development of Natural Language Processing Tools

The eXtensible Markup Language (XML) (Bray, et al., 1998) is the emerging standard for data representation and exchange on the World Wide Web. The XML Framework includes very powerful mechanisms for accessing and manipulating XML documents that are likely to significantly impact the development of tools for processing natural language and annotated corpora.

متن کامل

English and Persian Sport Newspaper Headlines: A comparative study of linguistic means

Abstract Using rhetorical figures in specialized languages like the language of newspaper headlines is common. The present study attempted to conduct a contrastive analysis of the English and Persian sport newspaper headlines related to the 2014 FIFA World Cup. Toward this end, a corpus consisting of 400 English and 400 Persian headlines published during 12th of June to 13th of July, 2014 was c...

متن کامل

English and Persian Sport Newspaper Headlines: A comparative study of linguistic means

Abstract Using rhetorical figures in specialized languages like the language of newspaper headlines is common. The present study attempted to conduct a contrastive analysis of the English and Persian sport newspaper headlines related to the 2014 FIFA World Cup. Toward this end, a corpus consisting of 400 English and 400 Persian headlines published during 12th of June to 13th of July, 2014 was c...

متن کامل

The Prestigious World University on its Homepage: The Promotional Academic Genre of Overview

In response to the competitive demands for establishing their international academic and financial credentials, the universities globally distribute some online introductory information about themselves. To this end, the university homepages have increasingly turned into the rhetorical space for the development of promotional academic texts in recent years. In this study, we examined university...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000